Skip to content

Add fast mb_strcut implementation for UTF-8 #12337

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

alexdowad
Copy link
Contributor

The old implementation decodes the entire string to pick out the part which should be returned by mb_strcut. This creates significant performance overhead. The new specialized implementation of mb_strcut for UTF-8 usually only examines a few bytes around the starting and ending cut points, meaning it generally runs in constant time.

For UTF-8 strings just a few bytes long, the new implementation is around 10% faster (according to microbenchmarks which I ran locally). For strings around 10,000 bytes in length, it is 50-300x faster. (Yes, that is 300x and not 300%.)

At the same time, I also added many more unit tests for mb_strcut. This will help to avoid unintended behavior changes as the function undergoes further performance work.

The new implementation behaves identically to the old one on valid UTF-8 strings; a fuzzer was used to help ensure this is the case. On invalid UTF-8 strings, there is a difference: the old implementation would convert invalid UTF-8 byte sequences to error markers ('?'), but the new implementation just cuts a subsequence of bytes out of the source string without performing any conversion on it.

Any comments will be much appreciated! @Girgias @cmb69 @youkidearitai @kamil-tekiela

@alexdowad
Copy link
Contributor Author

Maybe @iluuu1994 might have some comment?

@youkidearitai
Copy link
Contributor

I find different behavior. Is this expected?

$ sapi/cli/php -r 'var_dump(mb_strcut("あ\x80う", 3, 3));'
string(0) ""

But now available versions: https://3v4l.org/edcf7

string(1) "�"

Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable? One thing which might make it easier to check is to have the new test be in a prior commit with the current behaviour, and then have the feature as a follow-up.

Might want the opinions of @youkidearitai tho

@alexdowad
Copy link
Contributor Author

I find different behavior. Is this expected?

Hi, @youkidearitai. Yes, this is expected: if you look again at the commit message, it mentions that the new code has identical results on valid UTF-8 strings, but that is an invalid UTF-8 string.

The explanation is exactly as written above in the description of this PR.

@alexdowad
Copy link
Contributor Author

@youkidearitai BTW, I find it very impressive how you are (usually) able to discover interesting test cases for new code.

@alexdowad
Copy link
Contributor Author

Seems reasonable? One thing which might make it easier to check is to have the new test be in a prior commit with the current behaviour, and then have the feature as a follow-up.

Good idea, I can do that.

@youkidearitai
Copy link
Contributor

@alexdowad Thank you for reply.
My understanding is invalid UTF-8 that it is remove. However, 0xF5 (it will never come out of UTF-8) is returned as it is.

$ sapi/cli/php -r 'var_dump(bin2hex(mb_strcut("あ\xf5う", 3, 3)));'
string(2) "f5"

3v4l: https://3v4l.org/22jKX

Should this be returned as 0xF5 (and larger) as is?

@alexdowad
Copy link
Contributor Author

@youkidearitai Thanks... I have just run your test cases in a debugger to see what is happening.

The existing implementation of mb_strcut has completely different code for handling three types of text encoding: 1) constant byte width, 2) variable byte width, with mblen_table, and 3) variable byte width, without mblen_table.

UTF-8 is one of the text encodings with mblen_table. For such encodings, no conversion of erroneous byte sequences to ? is done. (My commit message is wrong on that; I need to adjust it.) The existing code steps through the string one char at a time, using the mblen_table to determine how many bytes to jump over for each character.

This means that the existing implementation can handle bytes like 0x80 in two different ways. If 0x80 appears as part of a multi-byte character, and you try cutting starting from 0x80, mb_strcut will back up to the beginning of the multi-byte character. However, if 0x80 appears outside a multi-byte character (which is invalid), mb_strcut will treat it as a character of its own and will not back up. (The reason why my fast implementation returns an empty string is because it backs up past the 0x80 byte when determining the end position of the cut.)

It is not possible to perfectly emulate the existing behavior (on invalid strings) without running through the entire string from the start. I could follow it a bit more closely, while still maintaining O(1) runtime, but there will always be corner cases where the behavior on invalid strings would differ.

Please mention if you have more thoughts about this.

The old implementation runs through the entire string to pick out the
part which should be returned by mb_strcut. This creates significant
performance overhead. The new specialized implementation of mb_strcut
for UTF-8 usually only examines a few bytes around the starting and
ending cut points, meaning it generally runs in constant time.

For UTF-8 strings just a few bytes long, the new implementation is
around 10% faster (according to microbenchmarks which I ran locally).
For strings around 10,000 bytes in length, it is 50-300x faster.
(Yes, that is 300x and not 300%.)

At the same time, I also added many more unit tests for mb_strcut.
This will help to avoid unintended behavior changes as the function
undergoes further performance work.

The new implementation behaves identically to the old one on VALID
UTF-8 strings; a fuzzer was used to help ensure this is the case.
On invalid UTF-8 strings, there is a difference: in some cases, the
old implementation will pass invalid byte sequences through unchanged,
while in others it will remove them. The new implementation has
behavior which is perhaps slightly more predictable: it simply backs
up the starting and ending cut points to the preceding "starter
byte" (one which is not a UTF-8 continuation byte).
@alexdowad
Copy link
Contributor Author

Just updated commit message and UPGRADING message to make them more accurate, based on what I learned from analyzing test cases provided by @youkidearitai.

@youkidearitai
Copy link
Contributor

I understand about UPGRADING.

By the way, personally, I think an ideal, if when appears invalid UTF-8 byte sequence convert to other byte (ex: ? or U+FFFD).

Copy link
Member

@iluuu1994 iluuu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from my side!

@youkidearitai
Copy link
Contributor

By the way, personally, I think an ideal, if when appears invalid UTF-8 byte sequence convert to other byte (ex: ? or U+FFFD).

Regarding this matter, when I investigate about mb_strcut behavior (and gave me advise), this PR seems make sense. Sorry for confusion.

@alexdowad
Copy link
Contributor Author

I have split the new test cases into a separate commit as advised by @Girgias.

@alexdowad
Copy link
Contributor Author

Landed on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants