Major overhaul of mbstring (part 2) #6402

alexdowad · 2020-11-05T22:04:17Z

I have been doing a lot of bug fixing and cleanup of mbstring over at #6052. However, I didn't anticipate how deep the mbstring rabbit hole is, and that PR has ballooned beyond all reasonable limits. Therefore, I am opening another one with some fixes and simplifications which can hopefully go in more quickly.

This PR aims to simplify mbstring by eliminating the use of mbfl_identify_filter (as discussed with @nikic elsewhere). A couple other things also had to be fixed for all the tests to pass after removing mbfl_identify_filter...

The removal of byte2/byte4 was not strictly necessary, but made the other work a bit easier.

I made some mistakes on this code, which meant that not everything which should be tested was actually being tested.

There is no meaningful difference between these and UCS-{2,4}. They are just a little bit more lax about passing errors silently. They also have no known use. Alias to UCS-{2,4} in case someone, somewhere is using them.

- Reject otherwise valid kuten codes which don't map to anything in JIS X 0208. - Handle truncated multi-byte characters as an error. - Convert Shift-JIS 0x7E to Unicode 0x203E (overline) as recommended by the Unicode Consortium, and as iconv does. - Convert Shift-JIS 0x5C to Unicode 0xA5 (yen sign) as recommended by the Unicode Consortium, and as iconv does. (NOTE: This will affect PHP scripts which use an internal encoding of Shift-JIS! PHP assigns a special meaning to 0x5C, the backslash. For example, it is used for escapes in double-quoted strings. Mapping the Shift-JIS yen sign to the Unicode yen sign means the yen sign will not be usable for C escapes in double-quoted strings. Japanese PHP programmers who want to write their source code in Shift-JIS for some strange reason will have to use the JIS X 0208 backlash or 'REVERSE SOLIDUS' character for their C escapes.) - Convert Unicode 0x5C (backslash) to Shift-JIS 0x815F (reverse solidus). - Immediately handle error if first Shift-JIS byte is over 0xEF, rather than waiting to see the next byte. (Previously, the value used was 0xFC, which is the limit for the 2nd byte and not the 1st byte of a multi-byte character.) - Don't allow 'control characters' to appear in the middle of a multi-byte character. The test case for bug 47399 is now obsolete. That test assumed that a number of Shift-JIS byte sequences which don't map to any character were 'valid' (because the byte values were within the legal ranges).

- Don't allow control characters to appear in the middle of a multi-byte character. (A strange feature, or perhaps misfeature, of mbstring which is not present in other libraries such as iconv.) - When checking whether string is valid, reject kuten codes which do not map to any character, whether converting from EUC-JP to another encoding, or converting another encoding which uses JIS X 0208/0212 charsets to EUC-JP. - Truncated multi-byte characters are treated as an error.

mbstring had an 'identify filter' for almost every supported text encoding which was used when auto-detecting the most likely encoding for a string. It would run over the string and set a 'flag' if it saw anything which did not appear likely to be the encoding in question. One problem with this scheme was that encodings which merely appeared less likely to be the correct one were completely rejected, even if there was no better candidate. Another problem was that the 'identify filters' had a huge amount of code duplication with the 'conversion filters'. Eliminate the identify filters. Instead, when auto-detecting text encoding, use conversion filters to see whether the input string is valid in candidate encodings or not. At the same type, watch the type of codepoints which the string decodes to and mark it as less likely if non-printable characters (ESC, form feed, bell, etc.) or 'private use area' codepoints are seen. Interestingly, one old test case in which JIS text was misidentified as UTF-8 (and this wrong behavior was enshrined in the test) was 'fixed' and the JIS string is now auto-detected as JIS.

alexdowad · 2020-11-06T07:28:55Z

Checked failure on Azure, it looks spurious. IMAP tests failed with an exception saying "mailbox already exists in ..."

KalleZ · 2020-11-06T09:12:11Z

Checked failure on Azure, it looks spurious. IMAP tests failed with an exception saying "mailbox already exists in ..."

cc @Girgias

Girgias · 2020-11-06T12:51:11Z

Checked failure on Azure, it looks spurious. IMAP tests failed with an exception saying "mailbox already exists in ..."

Indeed, seems a still messed up the parallel testing

alexdowad · 2020-11-06T13:26:58Z

One thing which maybe should be called out here:

In Shift-JIS text encoding, 0x5C is used for a yen sign (not a backslash as in ASCII). The same is true of some other legacy Japanese text encodings. Because 0x5C, the ASCII backslash, has special meaning in some contexts (such as some command shells or the C programming language), Japanese programmers may use yen signs in those same contexts.

However, the Unicode codepoint for a yen sign is not 0x5C but rather 0xA5. To correctly convert Shift-JIS to Unicode, SJIS 0x5C should be converted to Unicode 0xA5, rather than Unicode 0x5C (as mbstring currently does).

Since the Zend engine uses mbstring for internationalization of PHP code, fixing this bug means that some PHP programs written in Shift-JIS-encoded text may not work as expected. (The yen signs will no longer be treated as 'backslashes' for things like C escapes in double-quoted strings.)

Now, the JIS X 0208 character set does include a backslash character, and we do convert it to Unicode 0x5C, so that can still be used. So Japanese programmers using legacy text encodings for PHP coding are not left out in the cold; but it might be a surprise if someone finds an old PHP script doesn't work the same as it used to.

nikic

Looks great!

I think the only thing that's missing here is some more test coverage for whether the encoding detection works, because I'm not sure whether the current heuristic is sufficient. Simple thing to test would be take some small text samples in different languages, encode them in all encodings where they can be encoded, and then check whether the detected encoding comes back the same. Right now, we only seem to be testing a couple of Japanese encodings.

ext/mbstring/tests/sjis_encoding.phpt

alexdowad · 2020-11-09T11:43:15Z

Looks great!

I think the only thing that's missing here is some more test coverage for whether the encoding detection works, because I'm not sure whether the current heuristic is sufficient. Simple thing to test would be take some small text samples in different languages, encode them in all encodings where they can be encoded, and then check whether the detected encoding comes back the same. Right now, we only seem to be testing a couple of Japanese encodings.

Definitely, exhaustive testing of mb_detect_encoding is coming. Actually, not just mb_detect_encoding but all API functions exposed to PHP code will be 'torture tested' and any bugs discovered will be fixed. I fully expect that there will be more work on encoding detection at that time; for now, the fact that all existing tests pass shows that the new implementation works at least as well as the old one did.

For now, my priority is to put rigorous testing in place for mb_check_encoding and mb_convert_encoding for all supported text encodings.

alexdowad · 2020-11-09T11:48:54Z

Landed.

alexdowad added 2 commits November 5, 2020 12:57

Fix issues with mbstring encoding tests

856ea31

I made some mistakes on this code, which meant that not everything which should be tested was actually being tested.

Remove useless byte{2,4}{be,le} encodings from mbstring

a03dafd

There is no meaningful difference between these and UCS-{2,4}. They are just a little bit more lax about passing errors silently. They also have no known use. Alias to UCS-{2,4} in case someone, somewhere is using them.

alexdowad force-pushed the cleanup-mbstring-2 branch from 2fdc56e to 414ce1a Compare November 6, 2020 05:56

alexdowad added 4 commits November 6, 2020 08:17

Treat non-ASCII characters as erroneous when converting ASCII text

ab37929

alexdowad force-pushed the cleanup-mbstring-2 branch from 414ce1a to 1c4a8d1 Compare November 6, 2020 06:18

nikic approved these changes Nov 9, 2020

View reviewed changes

ext/mbstring/tests/sjis_encoding.phpt Show resolved Hide resolved

ext/mbstring/tests/sjis_encoding.phpt Show resolved Hide resolved

alexdowad closed this Nov 9, 2020

alexdowad mentioned this pull request Nov 9, 2020

Major overhaul of mbstring (part 3) #6416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Major overhaul of mbstring (part 2) #6402

Major overhaul of mbstring (part 2) #6402

Uh oh!

alexdowad commented Nov 5, 2020

Uh oh!

alexdowad commented Nov 6, 2020

Uh oh!

KalleZ commented Nov 6, 2020

Uh oh!

Girgias commented Nov 6, 2020

Uh oh!

alexdowad commented Nov 6, 2020 •

edited

Loading

Uh oh!

nikic left a comment

Uh oh!

Uh oh!

Uh oh!

alexdowad commented Nov 9, 2020

Uh oh!

alexdowad commented Nov 9, 2020

Uh oh!

Uh oh!

Major overhaul of mbstring (part 2) #6402

Major overhaul of mbstring (part 2) #6402

Uh oh!

Conversation

alexdowad commented Nov 5, 2020

Uh oh!

alexdowad commented Nov 6, 2020

Uh oh!

KalleZ commented Nov 6, 2020

Uh oh!

Girgias commented Nov 6, 2020

Uh oh!

alexdowad commented Nov 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alexdowad commented Nov 9, 2020

Uh oh!

alexdowad commented Nov 9, 2020

Uh oh!

Uh oh!

alexdowad commented Nov 6, 2020 •

edited

Loading