Add fast mb_strcut implementation for UTF-8 #12337

alexdowad · 2023-10-01T12:43:03Z

The old implementation decodes the entire string to pick out the part which should be returned by mb_strcut. This creates significant performance overhead. The new specialized implementation of mb_strcut for UTF-8 usually only examines a few bytes around the starting and ending cut points, meaning it generally runs in constant time.

For UTF-8 strings just a few bytes long, the new implementation is around 10% faster (according to microbenchmarks which I ran locally). For strings around 10,000 bytes in length, it is 50-300x faster. (Yes, that is 300x and not 300%.)

At the same time, I also added many more unit tests for mb_strcut. This will help to avoid unintended behavior changes as the function undergoes further performance work.

The new implementation behaves identically to the old one on valid UTF-8 strings; a fuzzer was used to help ensure this is the case. On invalid UTF-8 strings, there is a difference: the old implementation would convert invalid UTF-8 byte sequences to error markers ('?'), but the new implementation just cuts a subsequence of bytes out of the source string without performing any conversion on it.

Any comments will be much appreciated! @Girgias @cmb69 @youkidearitai @kamil-tekiela

alexdowad · 2023-10-01T12:46:14Z

Maybe @iluuu1994 might have some comment?

youkidearitai · 2023-10-01T14:44:17Z

I find different behavior. Is this expected?

$ sapi/cli/php -r 'var_dump(mb_strcut("あ\x80う", 3, 3));'
string(0) ""

But now available versions: https://3v4l.org/edcf7

string(1) "�"

Girgias

Seems reasonable? One thing which might make it easier to check is to have the new test be in a prior commit with the current behaviour, and then have the feature as a follow-up.

Might want the opinions of @youkidearitai tho

ext/mbstring/libmbfl/filters/mbfilter_utf8.c

alexdowad · 2023-10-01T14:53:30Z

I find different behavior. Is this expected?

Hi, @youkidearitai. Yes, this is expected: if you look again at the commit message, it mentions that the new code has identical results on valid UTF-8 strings, but that is an invalid UTF-8 string.

The explanation is exactly as written above in the description of this PR.

alexdowad · 2023-10-01T14:57:36Z

@youkidearitai BTW, I find it very impressive how you are (usually) able to discover interesting test cases for new code.

alexdowad · 2023-10-01T14:58:35Z

Seems reasonable? One thing which might make it easier to check is to have the new test be in a prior commit with the current behaviour, and then have the feature as a follow-up.

Good idea, I can do that.

youkidearitai · 2023-10-01T15:18:17Z

@alexdowad Thank you for reply.
My understanding is invalid UTF-8 that it is remove. However, 0xF5 (it will never come out of UTF-8) is returned as it is.

$ sapi/cli/php -r 'var_dump(bin2hex(mb_strcut("あ\xf5う", 3, 3)));'
string(2) "f5"

3v4l: https://3v4l.org/22jKX

Should this be returned as 0xF5 (and larger) as is?

alexdowad · 2023-10-01T16:09:48Z

@youkidearitai Thanks... I have just run your test cases in a debugger to see what is happening.

The existing implementation of mb_strcut has completely different code for handling three types of text encoding: 1) constant byte width, 2) variable byte width, with mblen_table, and 3) variable byte width, without mblen_table.

UTF-8 is one of the text encodings with mblen_table. For such encodings, no conversion of erroneous byte sequences to ? is done. (My commit message is wrong on that; I need to adjust it.) The existing code steps through the string one char at a time, using the mblen_table to determine how many bytes to jump over for each character.

This means that the existing implementation can handle bytes like 0x80 in two different ways. If 0x80 appears as part of a multi-byte character, and you try cutting starting from 0x80, mb_strcut will back up to the beginning of the multi-byte character. However, if 0x80 appears outside a multi-byte character (which is invalid), mb_strcut will treat it as a character of its own and will not back up. (The reason why my fast implementation returns an empty string is because it backs up past the 0x80 byte when determining the end position of the cut.)

It is not possible to perfectly emulate the existing behavior (on invalid strings) without running through the entire string from the start. I could follow it a bit more closely, while still maintaining O(1) runtime, but there will always be corner cases where the behavior on invalid strings would differ.

Please mention if you have more thoughts about this.

The old implementation runs through the entire string to pick out the part which should be returned by mb_strcut. This creates significant performance overhead. The new specialized implementation of mb_strcut for UTF-8 usually only examines a few bytes around the starting and ending cut points, meaning it generally runs in constant time. For UTF-8 strings just a few bytes long, the new implementation is around 10% faster (according to microbenchmarks which I ran locally). For strings around 10,000 bytes in length, it is 50-300x faster. (Yes, that is 300x and not 300%.) At the same time, I also added many more unit tests for mb_strcut. This will help to avoid unintended behavior changes as the function undergoes further performance work. The new implementation behaves identically to the old one on VALID UTF-8 strings; a fuzzer was used to help ensure this is the case. On invalid UTF-8 strings, there is a difference: in some cases, the old implementation will pass invalid byte sequences through unchanged, while in others it will remove them. The new implementation has behavior which is perhaps slightly more predictable: it simply backs up the starting and ending cut points to the preceding "starter byte" (one which is not a UTF-8 continuation byte).

alexdowad · 2023-10-02T21:09:47Z

Just updated commit message and UPGRADING message to make them more accurate, based on what I learned from analyzing test cases provided by @youkidearitai.

youkidearitai · 2023-10-03T00:54:08Z

I understand about UPGRADING.

By the way, personally, I think an ideal, if when appears invalid UTF-8 byte sequence convert to other byte (ex: ? or U+FFFD).

iluuu1994

Looks good from my side!

ext/mbstring/libmbfl/filters/mbfilter_htmlent.c

youkidearitai · 2023-10-03T09:16:49Z

By the way, personally, I think an ideal, if when appears invalid UTF-8 byte sequence convert to other byte (ex: ? or U+FFFD).

Regarding this matter, when I investigate about mb_strcut behavior (and gave me advise), this PR seems make sense. Sorry for confusion.

alexdowad · 2023-10-04T07:22:51Z

I have split the new test cases into a separate commit as advised by @Girgias.

alexdowad · 2023-10-04T07:24:29Z

Landed on master.

github-actions bot added the Extension: mbstring label Oct 1, 2023

Girgias reviewed Oct 1, 2023

View reviewed changes

ext/mbstring/libmbfl/filters/mbfilter_utf8.c Outdated Show resolved Hide resolved

ext/mbstring/libmbfl/filters/mbfilter_utf8.c Outdated Show resolved Hide resolved

alexdowad force-pushed the cututf8 branch from 12cadd1 to b7b28b3 Compare October 2, 2023 21:09

iluuu1994 approved these changes Oct 3, 2023

View reviewed changes

ext/mbstring/libmbfl/filters/mbfilter_htmlent.c Show resolved Hide resolved

alexdowad closed this Oct 4, 2023

alexdowad deleted the cututf8 branch October 4, 2023 07:24

youkidearitai mentioned this pull request Dec 19, 2023

mb_substr returns a different value in PHP 8.4 #12972

Closed

Add fast mb_strcut implementation for UTF-8 #12337

Add fast mb_strcut implementation for UTF-8 #12337

Uh oh!

Conversation

alexdowad commented Oct 1, 2023

Uh oh!

alexdowad commented Oct 1, 2023

Uh oh!

youkidearitai commented Oct 1, 2023

Uh oh!

Girgias left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alexdowad commented Oct 1, 2023

Uh oh!

alexdowad commented Oct 1, 2023

Uh oh!

alexdowad commented Oct 1, 2023

Uh oh!

youkidearitai commented Oct 1, 2023

Uh oh!

alexdowad commented Oct 1, 2023

Uh oh!

alexdowad commented Oct 2, 2023

Uh oh!

youkidearitai commented Oct 3, 2023

Uh oh!

iluuu1994 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

youkidearitai commented Oct 3, 2023

Uh oh!

alexdowad commented Oct 4, 2023

Uh oh!

alexdowad commented Oct 4, 2023

Uh oh!

Uh oh!