Skip to content

Major overhaul of mbstring (part 1) #6052

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 123 commits into from

Conversation

alexdowad
Copy link
Contributor

This PR aims to do the following to mbstring:

  • Slash reams of unneeded code
  • Untangle obscure code, make it easy to read
  • Comment all the remaining tricky parts
  • Make everything run dramatically faster
  • Fix bugs
  • Improve test coverage
  • Make slight adjustments to semantics so they are more logical and consistent

I was worried that some of the more tortured parts of mbstring might have been written that way for performance, but it turns out that wasn't the case. Almost all of these refactorings turned out to be good for performance.

To ensure that there were no performance regressions, I made a benchmark suite with 591 benchmarks covering different functions in mbstring, with different inputs (using different text encodings, long and short strings, and so on). It turned out that this suite did catch a couple of performance regressions, which were fixed.

Lots of juicy, low-hanging fruit for performance optimization still remains, and I hope to harvest more of it. There are also issues remaining with incomplete user documentation, poor code comments, inadequate test coverage, and so on. But one must start somewhere!

I may also post a few pretty graphs here showing how perf measures up between current master and this goodness.

As always, hoping for some good code review...

@alexdowad
Copy link
Contributor Author

Looks like LOC for mbstring is reduced by just about 2000 lines...

@alexdowad
Copy link
Contributor Author

Just fixing a compilation error on Windows.

@alexdowad
Copy link
Contributor Author

Trying to do a binary search to find the commit which broke the build on Windows...

@alexdowad alexdowad force-pushed the cleanup-mbstring-1 branch 2 times, most recently from ed0ee5e to 0649055 Compare August 30, 2020 20:34
@alexdowad
Copy link
Contributor Author

Whew! Looks like we are good on Windows now.

Copy link
Member

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

I'm reviewing by commit, sorry if comments don't make sense after later changes. Stopped at 9530674 for now.

As a high-level comment: As the name suggests, libmbfl was originally an independent library, but hasn't been maintained in a long while. Around 7.3/7.4 we started heavily patching it, but still keeping it conceptually separate. This PR drops the concept of libmbfl as a separate lib and uses Zend APIs directly inside it. I think that at this point there is no disadvantage to doing that, so I'm okay with this direction.

@alexdowad
Copy link
Contributor Author

Code review by @nikic is awesome, as usual! Will rebase and push corrected code soon.

Sorry that this PR is so huge. As a suggestion, if some low-risk, non-controversial commits can be merged first, that will reduce the size of the PR.

I'm not asking anyone else to actually merge them; I can do that, but just need the "thumbs up" to do so first.

@alexdowad
Copy link
Contributor Author

All improvements kindly suggested by @nikic applied. Rebased.

@nikic
Copy link
Member

nikic commented Aug 31, 2020

Sorry that this PR is so huge. As a suggestion, if some low-risk, non-controversial commits can be merged first, that will reduce the size of the PR.

Feel free to already commit everything up to (and including) 493c644. In this instance I'd suggest to commit everything as-is (i.e. keep the history).

@alexdowad
Copy link
Contributor Author

Sorry that this PR is so huge. As a suggestion, if some low-risk, non-controversial commits can be merged first, that will reduce the size of the PR.

Feel free to already commit everything up to (and including) 493c644. In this instance I'd suggest to commit everything as-is (i.e. keep the history).

Done. Thanks.

@Girgias
Copy link
Member

Girgias commented Sep 1, 2020

Currently has some memory leaks:

039+ [Mon Aug 31 21:30:07 2020]  Script:  '/home/travis/build/php/php-src/ext/mbstring/tests/mb_strcut.php'
040+ /home/travis/build/php/php-src/ext/mbstring/libmbfl/mbfl/mbfl_memory_device.c(42) :  Freeing 0x00007fe7e72030a0 (2 bytes), script=/home/travis/build/php/php-src/ext/mbstring/tests/mb_strcut.php
041+ Last leak repeated 6 times
042+ === Total 7 memory leaks detected ===

@alexdowad
Copy link
Contributor Author

Currently has some memory leaks:

039+ [Mon Aug 31 21:30:07 2020]  Script:  '/home/travis/build/php/php-src/ext/mbstring/tests/mb_strcut.php'
040+ /home/travis/build/php/php-src/ext/mbstring/libmbfl/mbfl/mbfl_memory_device.c(42) :  Freeing 0x00007fe7e72030a0 (2 bytes), script=/home/travis/build/php/php-src/ext/mbstring/tests/mb_strcut.php
041+ Last leak repeated 6 times
042+ === Total 7 memory leaks detected ===

Thanks, @Girgias. No more memory leak.

I added more commits onto this (already huge) PR...

Copy link
Member

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only started on the mbfilter refactoring... it's huge.

@alexdowad
Copy link
Contributor Author

alexdowad commented Sep 1, 2020

Incorporated all the suggestions from @nikic. Still thinking about the issue of whether there should be special-case logic for UTF-16 (as a performance optimization) or not.

I have realized there is one problem with the way I implemented strcut and also the general logic for counting backward from the end of a string. Investigating the best way to resolve this. However, code review can continue as per availability of the hardworking maintainers.

I will not merge d5235cb until confident that this issue is resolved.

@alexdowad alexdowad force-pushed the cleanup-mbstring-1 branch 5 times, most recently from 9b16c74 to c5c8d74 Compare September 2, 2020 07:02
@alexdowad
Copy link
Contributor Author

Just pushing an update which addresses most of the points just raised by @nikic. Will follow up with another update later which will address the other ones.

…ants of Shift-JIS)

Lots of problems here.

- Don't pass 'control' characters through silently in the middle of a
  multi-byte character.
- Treat it as an error if a multi-byte character is truncated.
- Add identify filters for precise identification of valid strings in both
  the DoCoMo, KDDI, and SoftBank variants of Shift-JIS.
- For ESC sequences used to encode emoji on earlier Softbank phones, if an
  invalid ESC sequence is found, don't pass it through. Rather, handle it as
  an error and respect `mb_substitute_character`.
- In ranges used by mobile vendors for emoji, if a certain byte sequence
  doesn't map to any emoji, don't emit a mangled value (actually a raw
  (ku*94)+ten value, which may not even be a valid Unicode codepoint at all).
- When converting Unicode to SJIS-Mobile, don't mangle codepoints which fall
  in the 2nd range of MicroSoft vendor extensions.

Also do a major code cleanup -- remove dead code, rearrange what is left,
use some new macros and helper functions to make the code clearer...
Sigh. Double sigh. After fruitlessly searching the Internet for information on
this mysterious text encoding called "SJIS-open", I wrote a script to try
converting every Unicode codepoint from 0-0xFFFF and compare the results from
different variants of Shift-JIS, to see which one "SJIS-open" would be most
similar to.

The result? It's just CP932. There is no difference at all. So why do we have
two implementations of CP932 in mbstring?

In case somebody, somewhere is using "SJIS-open" (or its aliases "SJIS-win" or
"SJIS-ms"), add these as aliases to CP932 so existing code will continue to
work.
- Identify filter was wrong (didn't recognize KDDI emoji)
- Treat it as an error if a multi-byte character or escape sequence is truncated
- When converting other encodings to ISO-2022-JP-KDDI, don't swallow trailing
  hash characters or digits
- Don't allow 'control' characters to appear in the middle of a multi-byte char
Note: I was not able to find any kind of official or even semi-official
specification for this legacy encoding. Therefore, this test suite is based
largely on the behavior of the existing code.

Verifying the correctness of program code in this way is very questionable.
In a sense, all you are proving is that the code "does what it does". However,
the test suite will still expose any unintended _changes_ to behavior.
- Identify filter was wrong
- Treat it as error if multi-byte string or escape sequence is truncated
- Don't allow 'control' characters or escape sequences to appear in the middle
  of a multi-byte char
As with ISO-2022-JP-KDDI, the main reference used to develop these tests was
the behavior of the existing code. It would have been better to have some
independent reference which we could cross-check our code against, but I
couldn't find one.
@alexdowad
Copy link
Contributor Author

I have pushed the fixes for Shift-JIS and Shift-JIS-2004 to the beginning of the patch series. When possible, please review these. (SJIS-2004 is one of the encodings which is currently has no identify filter.)

Interesting issue with ASCII. The identify filter for ASCII only accepts printable characters (including carriage return, line feed, and horizontal tab) and NUL. However, other characters such as the 'bell' character, vertical tab, form feed, backspace, ESC, and so on are valid ASCII.

However, when the identify filter is fixed, it means that 7-bit encodings like JIS7/8 and other ISO-2022 variants are never detected by mb_detect_encoding when mb_detect_order('auto'). Why? Well, ASCII is earlier in the default detect order, and all valid strings in those 7-bit encodings are valid ASCII as well.

Any thoughts about that?

@nikic
Copy link
Member

nikic commented Nov 4, 2020

However, when the identify filter is fixed, it means that 7-bit encodings like JIS7/8 and other ISO-2022 variants are never detected by mb_detect_encoding when mb_detect_order('auto'). Why? Well, ASCII is earlier in the default detect order, and all valid strings in those 7-bit encodings are valid ASCII as well.

Heh, I'm afraid we keep butchering things, because nobody knows how any of this is supposed to work :)

I think we need to rethink the meaning of identify filters. As already said before, the to_wchar filter is enough to validate the encoding, so let's just use that for that purpose. However, encoding detection is about more than just picking the single valid encoding. Especially for single byte encodings, the string may be valid under multiple encodings, but we can still pick out which one is more probable based on frequency analysis. See also our top-voted mbstring bug https://bugs.php.net/bug.php?id=38138 and the "detector" linked there: http://www.opennet.ru/base/dev/charset_autodetect.txt.html

I think the previous identify filter implementations (before your changes) kind of tried to do this, but in a very crude way. They basically only accepted "likely" characters, and considered any "unlikely" character to be directly invalid. Of course, this doesn't really make sense either.

To make this work, I think we should do the following:

  • During encoding detection, run both the to_wchar filter and the identify filter in parallel.
  • If to_wchar reports an error, discard the encoding immediately, it's illegal.
  • The identify filter increments a counter if the character is "likely" (or the other way around).
  • At the end, pick the encoding that has the most "likely" characters.

For the identify filter implementations, we'd probably have to bring back some of the previous identify filter code that got "fixed", as it probably isn't as broken as we thought when it comes to detecting likeliness...

From the technical side, the way I would do this is to chain the to_wchar filter and the identify filter: The identify filter will then be invoked once per character, with an already decoded character. However, for many single byte encodings it will be easier to check things on the encoded representation, so we should also provide a pointer to the start of the current character through a side channel. (All of this would be easier if to_wchar just returned the character, instead of chaining filters...)

Does that sound reasonable?

@alexdowad
Copy link
Contributor Author

However, when the identify filter is fixed, it means that 7-bit encodings like JIS7/8 and other ISO-2022 variants are never detected by mb_detect_encoding when mb_detect_order('auto'). Why? Well, ASCII is earlier in the default detect order, and all valid strings in those 7-bit encodings are valid ASCII as well.

Heh, I'm afraid we keep butchering things, because nobody knows how any of this is supposed to work :)

I think we need to rethink the meaning of identify filters. As already said before, the to_wchar filter is enough to validate the encoding, so let's just use that for that purpose. However, encoding detection is about more than just picking the single valid encoding. Especially for single byte encodings, the string may be valid under multiple encodings, but we can still pick out which one is more probable based on frequency analysis. See also our top-voted mbstring bug https://bugs.php.net/bug.php?id=38138 and the "detector" linked there: http://www.opennet.ru/base/dev/charset_autodetect.txt.html

I think the previous identify filter implementations (before your changes) kind of tried to do this, but in a very crude way. They basically only accepted "likely" characters, and considered any "unlikely" character to be directly invalid. Of course, this doesn't really make sense either.

To make this work, I think we should do the following:

  • During encoding detection, run both the to_wchar filter and the identify filter in parallel.
  • If to_wchar reports an error, discard the encoding immediately, it's illegal.
  • The identify filter increments a counter if the character is "likely" (or the other way around).
  • At the end, pick the encoding that has the most "likely" characters.

For the identify filter implementations, we'd probably have to bring back some of the previous identify filter code that got "fixed", as it probably isn't as broken as we thought when it comes to detecting likeliness...

From the technical side, the way I would do this is to chain the to_wchar filter and the identify filter: The identify filter will then be invoked once per character, with an already decoded character. However, for many single byte encodings it will be easier to check things on the encoded representation, so we should also provide a pointer to the start of the current character through a side channel. (All of this would be easier if to_wchar just returned the character, instead of chaining filters...)

Does that sound reasonable?

Hi, @nikic. Lots of good points here. Here's what I am thinking:

  • The conversion and identify filters are definitely redundant. We can do away with identify filters.
  • Right now, each identify filter struct has a flag field which basically means "doesn't match this encoding". We can add the same flag field to the conversion filter structs, to be set whenever the input string is found to be invalid. This covers cases where an input string is invalid, but in a way such that it does not make sense to increment illegal_chars. (For example, a required terminator is missing.)
  • I don't see why the job of judging whether a certain sequence of characters is "likely" or not should be specific to each encoding. It should be possible to write a generic "identify" function which works on wchars. If the generic identify filter sees control characters like 'end of transmission', 'shift in', 'shift out', 'bell', etc. that will reduce the estimated probability of the encoding under test being correct.
  • So the generic identify function will get its input from the conversion filter.
  • Yes, if the input string is valid in one encoding but invalid in another, the valid one will always be preferred over the invalid one, no matter how many "unlikely" codepoints are found in the valid one.
  • Subjectively, there are a lot of other things which might indicate that an encoding is the wrong one. For example, if you have a mix of Japanese, Cyrillic, and Greek letters in the same line, the guessed encoding is probably wrong. Some Chinese characters are extremely, extremely rare and might also be taken as "unlikely", but identifying which ones they are could be a big research project. I think we just want a couple of simple, robust heuristics which can be evaluated quickly.

Hope to hear more comments from you.

@nikic
Copy link
Member

nikic commented Nov 4, 2020

Right now, each identify filter struct has a flag field which basically means "doesn't match this encoding". We can add the same flag field to the conversion filter structs, to be set whenever the input string is found to be invalid. This covers cases where an input string is invalid, but in a way such that it does not make sense to increment illegal_chars. (For example, a required terminator is missing.)

Sounds reasonable to me. (Maybe make it is_invalid rather than flag ^^)

I don't see why the job of judging whether a certain sequence of characters is "likely" or not should be specific to each encoding. It should be possible to write a generic "identify" function which works on wchars. If the generic identify filter sees control characters like 'end of transmission', 'shift in', 'shift out', 'bell', etc. that will reduce the estimated probability of the encoding under test being correct.

That's an interesting idea. I think doing this generically could indeed work. Following http://www.opennet.ru/base/dev/charset_autodetect.txt.html we could do something like count the number of characters with L unicode character property and ignore other characters.

I expect there's a lot of prior art in this area -- it would be great to find some existing simple identification scheme that we could adopt. (Don't think we'll want to implement something overly sophisticated here.)

@alexdowad
Copy link
Contributor Author

That's an interesting idea. I think doing this generically could indeed work. Following http://www.opennet.ru/base/dev/charset_autodetect.txt.html we could do something like count the number of characters with L unicode character property and ignore other characters.

Is the idea that lowercase characters more common than uppercase?

@alexdowad
Copy link
Contributor Author

Just Googling for prior art. It seems like the most common approach is:

  • For character sets with a large number of characters, identify some set of 'common' characters and see how often these appear in the input string (when decoded using the encoding under test).
  • For character sets with a small number of characters, identify a set of common pairs and see how often those appear in the input string.

I like this. It would easily identify JIS strings as not being ASCII. For example, here is a JIS string: "F|K\8l%F%-%9%H$G$9!#1234#5#6#7#8#9!#". You can see that none of the individual characters in the string are 'rare', but there are no common consecutive pairs.

@alexdowad
Copy link
Contributor Author

@nikic, if we need to store a 'database' of common characters for each character set, how much space it would it be reasonable to use per text encoding? Is 64K per reasonable? Or make it smaller?

@nikic
Copy link
Member

nikic commented Nov 4, 2020

I'd prefer avoiding per-encoding tables if possible. I liked your original idea of making the detection generic based on Unicode code points, and think that would still work for your example. Even if we count both numbers and letters towards the probable characters, the ratio for your string will worse for ASCII than for JIS, presumably.

Encoding detection is a big rabbit hole, and I think we should solve it in the simplest and cheapest way we can that still produces somewhat adequate results.

@alexdowad
Copy link
Contributor Author

Encoding detection is a big rabbit hole, and I think we should solve it in the simplest and cheapest way we can that still produces somewhat adequate results.

Fair enough. So digits and letters (including letters in non-Latin alphabets) have the highest weight, whitespace lower, punctuation lower still, and control/unprintable characters the lowest.

Do we bother weighting lowercase and uppercase letters differently?

@alexdowad
Copy link
Contributor Author

Just working on getting rid of identify filters.

What would you think of axing mb_convert_encoding($string, 'HTML-ENTITIES', $source_encoding)? PHP has at least 3 different implementations of decoding HTML entities (2 in mbstring, 1 standard library function html_entity_decode).

If we really want to avoid a BC break, we could catch 'HTML-ENTITIES' in mb_convert_encoding and call html_entity_encode.

I'm asking because the html entity filter is troublesome for what I'm trying to do right now.

@nikic
Copy link
Member

nikic commented Nov 9, 2020

What would you think of axing mb_convert_encoding($string, 'HTML-ENTITIES', $source_encoding)? PHP has at least 3 different implementations of decoding HTML entities (2 in mbstring, 1 standard library function html_entity_decode).

If we really want to avoid a BC break, we could catch 'HTML-ENTITIES' in mb_convert_encoding and call html_entity_encode.

I'm asking because the html entity filter is troublesome for what I'm trying to do right now.

In what context is it causing issues?

@alexdowad
Copy link
Contributor Author

What would you think of axing mb_convert_encoding($string, 'HTML-ENTITIES', $source_encoding)? PHP has at least 3 different implementations of decoding HTML entities (2 in mbstring, 1 standard library function html_entity_decode).

In what context is it causing issues?

I already overcame the problem in another way. So I'm not worried about this one for now. Thanks for the response.

@mvorisek
Copy link
Contributor

Other PR parts are merged, should this be closed?

@nikic
Copy link
Member

nikic commented Mar 19, 2021

@mvorisek Most of this PR isn't merged yet. I believe the fixes for the encoding implementations have landed, but not the main changes to mbfl.

@Girgias
Copy link
Member

Girgias commented Jun 20, 2022

@alexdowad Have you split this PR into the various other ones you've made over the year or is there still something relevant in this PR?

@alexdowad
Copy link
Contributor Author

@alexdowad Have you split this PR into the various other ones you've made over the year or is there still something relevant in this PR?

@Girgias It's good to see that you are trying to clean up the GH issues.

There is still stuff in this PR which I may fish out, rework, and merge at some point... but I can still do that if you close it. So either way is fine.

@Girgias
Copy link
Member

Girgias commented Jan 25, 2023

Going to close this, as I trust @alexdowad to figure out what's still relevant from this out of all the conflicts.

@Girgias Girgias closed this Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants