-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Major overhaul of mbstring (part 2) #6402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I made some mistakes on this code, which meant that not everything which should be tested was actually being tested.
There is no meaningful difference between these and UCS-{2,4}. They are just a little bit more lax about passing errors silently. They also have no known use. Alias to UCS-{2,4} in case someone, somewhere is using them.
2fdc56e
to
414ce1a
Compare
- Reject otherwise valid kuten codes which don't map to anything in JIS X 0208. - Handle truncated multi-byte characters as an error. - Convert Shift-JIS 0x7E to Unicode 0x203E (overline) as recommended by the Unicode Consortium, and as iconv does. - Convert Shift-JIS 0x5C to Unicode 0xA5 (yen sign) as recommended by the Unicode Consortium, and as iconv does. (NOTE: This will affect PHP scripts which use an internal encoding of Shift-JIS! PHP assigns a special meaning to 0x5C, the backslash. For example, it is used for escapes in double-quoted strings. Mapping the Shift-JIS yen sign to the Unicode yen sign means the yen sign will not be usable for C escapes in double-quoted strings. Japanese PHP programmers who want to write their source code in Shift-JIS for some strange reason will have to use the JIS X 0208 backlash or 'REVERSE SOLIDUS' character for their C escapes.) - Convert Unicode 0x5C (backslash) to Shift-JIS 0x815F (reverse solidus). - Immediately handle error if first Shift-JIS byte is over 0xEF, rather than waiting to see the next byte. (Previously, the value used was 0xFC, which is the limit for the 2nd byte and not the 1st byte of a multi-byte character.) - Don't allow 'control characters' to appear in the middle of a multi-byte character. The test case for bug 47399 is now obsolete. That test assumed that a number of Shift-JIS byte sequences which don't map to any character were 'valid' (because the byte values were within the legal ranges).
- Don't allow control characters to appear in the middle of a multi-byte character. (A strange feature, or perhaps misfeature, of mbstring which is not present in other libraries such as iconv.) - When checking whether string is valid, reject kuten codes which do not map to any character, whether converting from EUC-JP to another encoding, or converting another encoding which uses JIS X 0208/0212 charsets to EUC-JP. - Truncated multi-byte characters are treated as an error.
mbstring had an 'identify filter' for almost every supported text encoding which was used when auto-detecting the most likely encoding for a string. It would run over the string and set a 'flag' if it saw anything which did not appear likely to be the encoding in question. One problem with this scheme was that encodings which merely appeared less likely to be the correct one were completely rejected, even if there was no better candidate. Another problem was that the 'identify filters' had a huge amount of code duplication with the 'conversion filters'. Eliminate the identify filters. Instead, when auto-detecting text encoding, use conversion filters to see whether the input string is valid in candidate encodings or not. At the same type, watch the type of codepoints which the string decodes to and mark it as less likely if non-printable characters (ESC, form feed, bell, etc.) or 'private use area' codepoints are seen. Interestingly, one old test case in which JIS text was misidentified as UTF-8 (and this wrong behavior was enshrined in the test) was 'fixed' and the JIS string is now auto-detected as JIS.
414ce1a
to
1c4a8d1
Compare
Checked failure on Azure, it looks spurious. IMAP tests failed with an exception saying "mailbox already exists in ..." |
cc @Girgias |
Indeed, seems a still messed up the parallel testing |
One thing which maybe should be called out here: In Shift-JIS text encoding, However, the Unicode codepoint for a yen sign is not Since the Zend engine uses mbstring for internationalization of PHP code, fixing this bug means that some PHP programs written in Shift-JIS-encoded text may not work as expected. (The yen signs will no longer be treated as 'backslashes' for things like C escapes in double-quoted strings.) Now, the JIS X 0208 character set does include a backslash character, and we do convert it to Unicode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
I think the only thing that's missing here is some more test coverage for whether the encoding detection works, because I'm not sure whether the current heuristic is sufficient. Simple thing to test would be take some small text samples in different languages, encode them in all encodings where they can be encoded, and then check whether the detected encoding comes back the same. Right now, we only seem to be testing a couple of Japanese encodings.
Definitely, exhaustive testing of For now, my priority is to put rigorous testing in place for |
Landed. |
I have been doing a lot of bug fixing and cleanup of mbstring over at #6052. However, I didn't anticipate how deep the mbstring rabbit hole is, and that PR has ballooned beyond all reasonable limits. Therefore, I am opening another one with some fixes and simplifications which can hopefully go in more quickly.
This PR aims to simplify mbstring by eliminating the use of
mbfl_identify_filter
(as discussed with @nikic elsewhere). A couple other things also had to be fixed for all the tests to pass after removingmbfl_identify_filter
...The removal of
byte2
/byte4
was not strictly necessary, but made the other work a bit easier.