Skip to content

Major overhaul of mbstring (part 2) #6402

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

alexdowad
Copy link
Contributor

I have been doing a lot of bug fixing and cleanup of mbstring over at #6052. However, I didn't anticipate how deep the mbstring rabbit hole is, and that PR has ballooned beyond all reasonable limits. Therefore, I am opening another one with some fixes and simplifications which can hopefully go in more quickly.

This PR aims to simplify mbstring by eliminating the use of mbfl_identify_filter (as discussed with @nikic elsewhere). A couple other things also had to be fixed for all the tests to pass after removing mbfl_identify_filter...

The removal of byte2/byte4 was not strictly necessary, but made the other work a bit easier.

I made some mistakes on this code, which meant that not everything which
should be tested was actually being tested.
There is no meaningful difference between these and UCS-{2,4}. They are
just a little bit more lax about passing errors silently. They also have
no known use.

Alias to UCS-{2,4} in case someone, somewhere is using them.
- Reject otherwise valid kuten codes which don't map to anything in JIS X 0208.
- Handle truncated multi-byte characters as an error.
- Convert Shift-JIS 0x7E to Unicode 0x203E (overline) as recommended by the
  Unicode Consortium, and as iconv does.
- Convert Shift-JIS 0x5C to Unicode 0xA5 (yen sign) as recommended by the
  Unicode Consortium, and as iconv does.
  (NOTE: This will affect PHP scripts which use an internal encoding of
  Shift-JIS! PHP assigns a special meaning to 0x5C, the backslash. For example,
  it is used for escapes in double-quoted strings. Mapping the Shift-JIS yen
  sign to the Unicode yen sign means the yen sign will not be usable for
  C escapes in double-quoted strings. Japanese PHP programmers who want to
  write their source code in Shift-JIS for some strange reason will have to
  use the JIS X 0208 backlash or 'REVERSE SOLIDUS' character for their C
  escapes.)
- Convert Unicode 0x5C (backslash) to Shift-JIS 0x815F (reverse solidus).
- Immediately handle error if first Shift-JIS byte is over 0xEF, rather than
  waiting to see the next byte. (Previously, the value used was 0xFC, which is
  the limit for the 2nd byte and not the 1st byte of a multi-byte character.)
- Don't allow 'control characters' to appear in the middle of a multi-byte
  character.

The test case for bug 47399 is now obsolete. That test assumed that a number
of Shift-JIS byte sequences which don't map to any character were 'valid'
(because the byte values were within the legal ranges).
- Don't allow control characters to appear in the middle of a multi-byte
  character. (A strange feature, or perhaps misfeature, of mbstring which is
  not present in other libraries such as iconv.)
- When checking whether string is valid, reject kuten codes which do not
  map to any character, whether converting from EUC-JP to another encoding,
  or converting another encoding which uses JIS X 0208/0212 charsets to
  EUC-JP.
- Truncated multi-byte characters are treated as an error.
mbstring had an 'identify filter' for almost every supported text encoding
which was used when auto-detecting the most likely encoding for a string.
It would run over the string and set a 'flag' if it saw anything which
did not appear likely to be the encoding in question.

One problem with this scheme was that encodings which merely appeared
less likely to be the correct one were completely rejected, even if there
was no better candidate. Another problem was that the 'identify filters'
had a huge amount of code duplication with the 'conversion filters'.

Eliminate the identify filters. Instead, when auto-detecting text
encoding, use conversion filters to see whether the input string is valid
in candidate encodings or not. At the same type, watch the type of
codepoints which the string decodes to and mark it as less likely if
non-printable characters (ESC, form feed, bell, etc.) or 'private use
area' codepoints are seen.

Interestingly, one old test case in which JIS text was misidentified
as UTF-8 (and this wrong behavior was enshrined in the test) was 'fixed'
and the JIS string is now auto-detected as JIS.
@alexdowad
Copy link
Contributor Author

Checked failure on Azure, it looks spurious. IMAP tests failed with an exception saying "mailbox already exists in ..."

@KalleZ
Copy link
Member

KalleZ commented Nov 6, 2020

Checked failure on Azure, it looks spurious. IMAP tests failed with an exception saying "mailbox already exists in ..."

cc @Girgias

@Girgias
Copy link
Member

Girgias commented Nov 6, 2020

Checked failure on Azure, it looks spurious. IMAP tests failed with an exception saying "mailbox already exists in ..."

Indeed, seems a still messed up the parallel testing

@alexdowad
Copy link
Contributor Author

alexdowad commented Nov 6, 2020

One thing which maybe should be called out here:

In Shift-JIS text encoding, 0x5C is used for a yen sign (not a backslash as in ASCII). The same is true of some other legacy Japanese text encodings. Because 0x5C, the ASCII backslash, has special meaning in some contexts (such as some command shells or the C programming language), Japanese programmers may use yen signs in those same contexts.

However, the Unicode codepoint for a yen sign is not 0x5C but rather 0xA5. To correctly convert Shift-JIS to Unicode, SJIS 0x5C should be converted to Unicode 0xA5, rather than Unicode 0x5C (as mbstring currently does).

Since the Zend engine uses mbstring for internationalization of PHP code, fixing this bug means that some PHP programs written in Shift-JIS-encoded text may not work as expected. (The yen signs will no longer be treated as 'backslashes' for things like C escapes in double-quoted strings.)

Now, the JIS X 0208 character set does include a backslash character, and we do convert it to Unicode 0x5C, so that can still be used. So Japanese programmers using legacy text encodings for PHP coding are not left out in the cold; but it might be a surprise if someone finds an old PHP script doesn't work the same as it used to.

Copy link
Member

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

I think the only thing that's missing here is some more test coverage for whether the encoding detection works, because I'm not sure whether the current heuristic is sufficient. Simple thing to test would be take some small text samples in different languages, encode them in all encodings where they can be encoded, and then check whether the detected encoding comes back the same. Right now, we only seem to be testing a couple of Japanese encodings.

@alexdowad
Copy link
Contributor Author

Looks great!

I think the only thing that's missing here is some more test coverage for whether the encoding detection works, because I'm not sure whether the current heuristic is sufficient. Simple thing to test would be take some small text samples in different languages, encode them in all encodings where they can be encoded, and then check whether the detected encoding comes back the same. Right now, we only seem to be testing a couple of Japanese encodings.

Definitely, exhaustive testing of mb_detect_encoding is coming. Actually, not just mb_detect_encoding but all API functions exposed to PHP code will be 'torture tested' and any bugs discovered will be fixed. I fully expect that there will be more work on encoding detection at that time; for now, the fact that all existing tests pass shows that the new implementation works at least as well as the old one did.

For now, my priority is to put rigorous testing in place for mb_check_encoding and mb_convert_encoding for all supported text encodings.

@alexdowad
Copy link
Contributor Author

Landed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants