-
Notifications
You must be signed in to change notification settings - Fork 7.8k
Major overhaul of mbstring (part 1) #6052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Looks like LOC for mbstring is reduced by just about 2000 lines... |
c5f26f4
to
ea08ef8
Compare
Just fixing a compilation error on Windows. |
c7127e7
to
ccd380d
Compare
Trying to do a binary search to find the commit which broke the build on Windows... |
ed0ee5e
to
0649055
Compare
Whew! Looks like we are good on Windows now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
I'm reviewing by commit, sorry if comments don't make sense after later changes. Stopped at 9530674 for now.
As a high-level comment: As the name suggests, libmbfl was originally an independent library, but hasn't been maintained in a long while. Around 7.3/7.4 we started heavily patching it, but still keeping it conceptually separate. This PR drops the concept of libmbfl as a separate lib and uses Zend APIs directly inside it. I think that at this point there is no disadvantage to doing that, so I'm okay with this direction.
Code review by @nikic is awesome, as usual! Will rebase and push corrected code soon. Sorry that this PR is so huge. As a suggestion, if some low-risk, non-controversial commits can be merged first, that will reduce the size of the PR. I'm not asking anyone else to actually merge them; I can do that, but just need the "thumbs up" to do so first. |
0649055
to
995d228
Compare
All improvements kindly suggested by @nikic applied. Rebased. |
Feel free to already commit everything up to (and including) 493c644. In this instance I'd suggest to commit everything as-is (i.e. keep the history). |
995d228
to
1dff0f0
Compare
Done. Thanks. |
Currently has some memory leaks:
|
1dff0f0
to
321a9c1
Compare
Thanks, @Girgias. No more memory leak. I added more commits onto this (already huge) PR... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only started on the mbfilter refactoring... it's huge.
Incorporated all the suggestions from @nikic. Still thinking about the issue of whether there should be special-case logic for UTF-16 (as a performance optimization) or not. I have realized there is one problem with the way I implemented I will not merge d5235cb until confident that this issue is resolved. |
9b16c74
to
c5c8d74
Compare
c5c8d74
to
50fd46f
Compare
Just pushing an update which addresses most of the points just raised by @nikic. Will follow up with another update later which will address the other ones. |
…ants of Shift-JIS) Lots of problems here. - Don't pass 'control' characters through silently in the middle of a multi-byte character. - Treat it as an error if a multi-byte character is truncated. - Add identify filters for precise identification of valid strings in both the DoCoMo, KDDI, and SoftBank variants of Shift-JIS. - For ESC sequences used to encode emoji on earlier Softbank phones, if an invalid ESC sequence is found, don't pass it through. Rather, handle it as an error and respect `mb_substitute_character`. - In ranges used by mobile vendors for emoji, if a certain byte sequence doesn't map to any emoji, don't emit a mangled value (actually a raw (ku*94)+ten value, which may not even be a valid Unicode codepoint at all). - When converting Unicode to SJIS-Mobile, don't mangle codepoints which fall in the 2nd range of MicroSoft vendor extensions. Also do a major code cleanup -- remove dead code, rearrange what is left, use some new macros and helper functions to make the code clearer...
Sigh. Double sigh. After fruitlessly searching the Internet for information on this mysterious text encoding called "SJIS-open", I wrote a script to try converting every Unicode codepoint from 0-0xFFFF and compare the results from different variants of Shift-JIS, to see which one "SJIS-open" would be most similar to. The result? It's just CP932. There is no difference at all. So why do we have two implementations of CP932 in mbstring? In case somebody, somewhere is using "SJIS-open" (or its aliases "SJIS-win" or "SJIS-ms"), add these as aliases to CP932 so existing code will continue to work.
- Identify filter was wrong (didn't recognize KDDI emoji) - Treat it as an error if a multi-byte character or escape sequence is truncated - When converting other encodings to ISO-2022-JP-KDDI, don't swallow trailing hash characters or digits - Don't allow 'control' characters to appear in the middle of a multi-byte char
Note: I was not able to find any kind of official or even semi-official specification for this legacy encoding. Therefore, this test suite is based largely on the behavior of the existing code. Verifying the correctness of program code in this way is very questionable. In a sense, all you are proving is that the code "does what it does". However, the test suite will still expose any unintended _changes_ to behavior.
- Identify filter was wrong - Treat it as error if multi-byte string or escape sequence is truncated - Don't allow 'control' characters or escape sequences to appear in the middle of a multi-byte char
As with ISO-2022-JP-KDDI, the main reference used to develop these tests was the behavior of the existing code. It would have been better to have some independent reference which we could cross-check our code against, but I couldn't find one.
421f0b2
to
25ac652
Compare
I have pushed the fixes for Shift-JIS and Shift-JIS-2004 to the beginning of the patch series. When possible, please review these. (SJIS-2004 is one of the encodings which is currently has no identify filter.) Interesting issue with ASCII. The identify filter for ASCII only accepts printable characters (including carriage return, line feed, and horizontal tab) and NUL. However, other characters such as the 'bell' character, vertical tab, form feed, backspace, ESC, and so on are valid ASCII. However, when the identify filter is fixed, it means that 7-bit encodings like JIS7/8 and other ISO-2022 variants are never detected by Any thoughts about that? |
Heh, I'm afraid we keep butchering things, because nobody knows how any of this is supposed to work :) I think we need to rethink the meaning of identify filters. As already said before, the to_wchar filter is enough to validate the encoding, so let's just use that for that purpose. However, encoding detection is about more than just picking the single valid encoding. Especially for single byte encodings, the string may be valid under multiple encodings, but we can still pick out which one is more probable based on frequency analysis. See also our top-voted mbstring bug https://bugs.php.net/bug.php?id=38138 and the "detector" linked there: http://www.opennet.ru/base/dev/charset_autodetect.txt.html I think the previous identify filter implementations (before your changes) kind of tried to do this, but in a very crude way. They basically only accepted "likely" characters, and considered any "unlikely" character to be directly invalid. Of course, this doesn't really make sense either. To make this work, I think we should do the following:
For the identify filter implementations, we'd probably have to bring back some of the previous identify filter code that got "fixed", as it probably isn't as broken as we thought when it comes to detecting likeliness... From the technical side, the way I would do this is to chain the to_wchar filter and the identify filter: The identify filter will then be invoked once per character, with an already decoded character. However, for many single byte encodings it will be easier to check things on the encoded representation, so we should also provide a pointer to the start of the current character through a side channel. (All of this would be easier if to_wchar just returned the character, instead of chaining filters...) Does that sound reasonable? |
Hi, @nikic. Lots of good points here. Here's what I am thinking:
Hope to hear more comments from you. |
Sounds reasonable to me. (Maybe make it
That's an interesting idea. I think doing this generically could indeed work. Following http://www.opennet.ru/base/dev/charset_autodetect.txt.html we could do something like count the number of characters with L unicode character property and ignore other characters. I expect there's a lot of prior art in this area -- it would be great to find some existing simple identification scheme that we could adopt. (Don't think we'll want to implement something overly sophisticated here.) |
Is the idea that lowercase characters more common than uppercase? |
Just Googling for prior art. It seems like the most common approach is:
I like this. It would easily identify JIS strings as not being ASCII. For example, here is a JIS string: "F|K\8l%F%-%9%H$G$9!#1234#5#6#7#8#9!#". You can see that none of the individual characters in the string are 'rare', but there are no common consecutive pairs. |
@nikic, if we need to store a 'database' of common characters for each character set, how much space it would it be reasonable to use per text encoding? Is 64K per reasonable? Or make it smaller? |
I'd prefer avoiding per-encoding tables if possible. I liked your original idea of making the detection generic based on Unicode code points, and think that would still work for your example. Even if we count both numbers and letters towards the probable characters, the ratio for your string will worse for ASCII than for JIS, presumably. Encoding detection is a big rabbit hole, and I think we should solve it in the simplest and cheapest way we can that still produces somewhat adequate results. |
Fair enough. So digits and letters (including letters in non-Latin alphabets) have the highest weight, whitespace lower, punctuation lower still, and control/unprintable characters the lowest. Do we bother weighting lowercase and uppercase letters differently? |
Just working on getting rid of identify filters. What would you think of axing If we really want to avoid a BC break, we could catch 'HTML-ENTITIES' in I'm asking because the html entity filter is troublesome for what I'm trying to do right now. |
In what context is it causing issues? |
I already overcame the problem in another way. So I'm not worried about this one for now. Thanks for the response. |
Other PR parts are merged, should this be closed? |
@mvorisek Most of this PR isn't merged yet. I believe the fixes for the encoding implementations have landed, but not the main changes to mbfl. |
@alexdowad Have you split this PR into the various other ones you've made over the year or is there still something relevant in this PR? |
@Girgias It's good to see that you are trying to clean up the GH issues. There is still stuff in this PR which I may fish out, rework, and merge at some point... but I can still do that if you close it. So either way is fine. |
Going to close this, as I trust @alexdowad to figure out what's still relevant from this out of all the conflicts. |
This PR aims to do the following to mbstring:
I was worried that some of the more tortured parts of mbstring might have been written that way for performance, but it turns out that wasn't the case. Almost all of these refactorings turned out to be good for performance.
To ensure that there were no performance regressions, I made a benchmark suite with 591 benchmarks covering different functions in mbstring, with different inputs (using different text encodings, long and short strings, and so on). It turned out that this suite did catch a couple of performance regressions, which were fixed.
Lots of juicy, low-hanging fruit for performance optimization still remains, and I hope to harvest more of it. There are also issues remaining with incomplete user documentation, poor code comments, inadequate test coverage, and so on. But one must start somewhere!
I may also post a few pretty graphs here showing how perf measures up between current
master
and this goodness.As always, hoping for some good code review...