Major overhaul of mbstring (part 3) #6416

alexdowad · 2020-11-09T21:06:12Z

A continuation of #6052 and #6402.

These are mostly fixes for Shift-JIS-2004 and MacJapanese (SJIS-mac), along with a couple other things.

@nikic... FYA

ext/mbstring/libmbfl/filters/mbfilter_gb18030.c

ext/mbstring/libmbfl/filters/mbfilter_sjis_2004.c

nikic · 2020-11-10T09:55:04Z

ext/mbstring/tests/sjis2004_encoding.phpt

+			/* Two input byte sequences can translate to either a 'halfwidth' or a
+			 * 'fullwidth' version of a character; our implementation of SJIS-2004
+			 * translates them to the fullwidth versions */
+			if (preg_match('/Fullwidth: U\+([0-9A-F]+)/', $line, $match))


As this only affects two codepoints, maybe use the generic parsing function and overwrite them explicitly, as you did for other encodings? That would allow us to drop the comments from the input file, which is quite large (300 KB).

Personally, I like the fact that we use unmodified, "official" conversion tables as the basis for the tests. But if that is what you want, no sweat.

Just say "do it".

TBH I'm not sure on this one. Just a bit worried about bloating the distribution with large test-only data files. At least these should gzip well, so maybe I'm worried for nothing.

👍🏻 for thinking about the impact on all PHP core developers, not just on mbstring.

We could gzdeflate the file before parsing it. It would mean that zlib has to be enabled to run the mbstring tests. It would also mean that if the files are ever modified (which is not likely) we wouldn't get nice diffs.

I believe that git stores committed files in a compressed format, so the size of the repo may not increase that much either way. Of course, the size of a checkout will be a bit bigger.

Whichever way you prefer is fine.

ext/mbstring/tests/sjis2004_encoding.phpt

ext/mbstring/mbstring_singlebyte.h

ext/mbstring/mbstring_singlebyte.c

Previously, the unit tests for these text encodings covered all mappings from legacy -> Unicode, and all _reversible_ mappings from Unicode -> legacy. However, we should also test the few Unicode -> legacy mappings which are not reversible.

Each flush function in a chain of mbstring conversion filters always calls the next flush function in the chain. So it is not necessary to explicitly flush the second filter in a chain. (Due to this bug, in many cases, flush functions were actually being called three times.)

This faulty binary search would never reject values at the very high end of the range being searched, even if they were not actually in the table. Among other things, this meant that some Unicode codepoints which do not correspond to any character in JIS X 0213 would be converted to bogus Shift-JIS-2004 values rather than being rejected.

… for Kanji correctly If the 2nd byte of a 2-byte character is invalid, then mb_substitute_character() should be respected. Instead, what mbstring was doing was 'swallowing' the first byte, then emitting the 2nd byte as if it was an ASCII character. Likewise, if the 2nd byte is missing, instead of just keeping quiet, report an illegal character as specified by mb_substitute_character().

… in SJIS-2004 Unicode has 'combining' characters which join with another following character. Japanese hiragana and katakana with the 'two dots' voice mark can be represented in this way, with one Unicode character for the 'base' kana and another one which adds the voice mark. In SJIS-2004, however, there are dedicated characters for voiced and unvoiced kana. So some special checks are done to identify sequences of Unicode characters which need to be 'collapsed' into a single SJIS-2004 character. If a kana is immediately followed by some other unrelated character, like a Cyrillic letter, then the cached kana should be output 'as is' and we proceed with encoding the unrelated character. When doing this, though, we need to re-initialize local variables, or else the unrelated character will be mangled in some cases.

…for Kanji correctly Also, don't accept 1st bytes above 0xED, since none of the possible 2-byte sequences starting with 0xEE and above are actually mapped to any character.

Since 1993, Unicode has had a specific codepoint for a fullwidth Yen sign. Likewise, MacJapanese has separate kuten codes for halfwidth and fullwidth Yen signs. But mbstring mapped _both_ Yen sign codepoints to the MacJapanese fullwidth Yen sign. It's probably more appropriate to map the 'ordinary' Yen sign to the MacJapanese halfwidth Yen sign. Besides, this means that the conversion between Unicode and MacJapanese is closer to being lossless and reversible.

…depoints When converting Unicode to MacJapanese, some special sequences of Unicode codepoints are collapsed into a single SJIS character. When the implementation sees a codepoint which *might* begin such a sequence, it is cached and examined again after the next codepoint arrives. If it turns out that it wasn't one of the 'special' sequences, then a 'fallback' conversion table is consulted to convert the cached codepoint. Then we re-enter the regular conversion code to convert the immediately following codepoint. BUT, local variables need to be reinitialized properly when doing this! Because the locals weren't reinitialized, the sad result was that some codepoints would get chopped up into bit salad and emitted as something totally bogus (which might not even be valid SJIS-mac text at all).

…a bad transcoding hint To give the background on this issue, here is an excerpt from JAPANESE.txt, from the Unicode Consortium: Apple has defined a block of 32 corporate characters as "transcoding hints." These are used in combination with standard Unicode characters to force them to be treated in a special way for mapping to other encodings; they have no other effect. Sixteen of these transcoding hints are "grouping hints" - they indicate that the next 2-4 Unicode characters should be treated as a single entity for transcoding. The other sixteen transcoding hints are "variant tags" - they are like combining characters, and can follow a standard Unicode (or a sequence consisting of a base character and other combining characters) to cause it to be treated in a special way for transcoding. These always terminate a combining-character sequence. The transcoding coding hints used in this mapping table are: 0xF860 group next 2 characters as a single entity for transcoding 0xF861 group next 3 characters as a single entity for transcoding 0xF862 group next 4 characters as a single entity for transcoding 0xF87A variant tag for "negative" (i.e. black & white reversed) 0xF87E variant tag for vertical form 0xF87F variant tag for other alternate form For example, the Apple addition character 0x85AB is Roman numeral thirteen. There is no single Unicode for this (although there are standard Unicodes for Roman numerals 1-12). Using the grouping hint 0xF862 in combination with standard Unicodes, we can map this as 0xF862+0x0058+0x0049+0x0049+0x0049 (i.e. X + I + I + I). Our SJIS-mac conversion code actually recognizes some special sequences which start with an Apple 'transcoding hint'. However, if a transcoding hint is misplaced and is not followed by one of the expected sequences, we can just emit one error marker for the bad transcoding hint and then process the following codepoint as normal.

For consistency with UTF-16 and UCS-4. Also, do some code cleanup.

- For consistency with UTF-16, UTF-32, and UCS-4, strip leading byte order marks. - Treat it as an error if string is truncated (i.e. has an odd number of bytes).

alexdowad · 2020-11-11T07:20:31Z

Just pushing an update.

As always, thanks for the great review!

We can squeeze out a lot of duplicated code in this way.

alexdowad · 2020-11-11T09:24:16Z

Landed.

alexdowad force-pushed the cleanup-mbstring-3 branch from c5a81c2 to 469f2bf Compare November 10, 2020 04:03

nikic reviewed Nov 10, 2020

View reviewed changes

alexdowad added 13 commits November 11, 2020 09:00

Test EUC-JP and Shift-JIS more thoroughly

8cd6069

Previously, the unit tests for these text encodings covered all mappings from legacy -> Unicode, and all _reversible_ mappings from Unicode -> legacy. However, we should also test the few Unicode -> legacy mappings which are not reversible.

Add test suite for SJIS-2004 encoding

673d917

SJIS-mac encoding conversion: handle invalid (or truncated) 2nd byte …

6309d19

…for Kanji correctly Also, don't accept 1st bytes above 0xED, since none of the possible 2-byte sequences starting with 0xEE and above are actually mapped to any character.

Add test suite for SJIS-mac encoding

f73ad21

Leading BOM is stripped for UTF-32

7ae459f

For consistency with UTF-16 and UCS-4. Also, do some code cleanup.

Enhance mbstring support for UCS-2 text

3a3b717

- For consistency with UTF-16, UTF-32, and UCS-4, strip leading byte order marks. - Treat it as an error if string is truncated (i.e. has an odd number of bytes).

alexdowad force-pushed the cleanup-mbstring-3 branch from 469f2bf to a513aa1 Compare November 11, 2020 07:20

Consolidate all single-byte encodings in one source file

dec3eca

We can squeeze out a lot of duplicated code in this way.

alexdowad force-pushed the cleanup-mbstring-3 branch from a513aa1 to dec3eca Compare November 11, 2020 07:30

nikic approved these changes Nov 11, 2020

View reviewed changes

alexdowad closed this Nov 11, 2020

alexdowad mentioned this pull request Nov 16, 2020

Major overhaul of mbstring (part 4) #6430

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Major overhaul of mbstring (part 3) #6416

Major overhaul of mbstring (part 3) #6416

Uh oh!

alexdowad commented Nov 9, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nikic Nov 10, 2020

Uh oh!

alexdowad Nov 10, 2020

Uh oh!

nikic Nov 10, 2020

Uh oh!

alexdowad Nov 11, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexdowad commented Nov 11, 2020

Uh oh!

alexdowad commented Nov 11, 2020

Uh oh!

Uh oh!

Major overhaul of mbstring (part 3) #6416

Major overhaul of mbstring (part 3) #6416

Uh oh!

Conversation

alexdowad commented Nov 9, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nikic Nov 10, 2020

Choose a reason for hiding this comment

Uh oh!

alexdowad Nov 10, 2020

Choose a reason for hiding this comment

Uh oh!

nikic Nov 10, 2020

Choose a reason for hiding this comment

Uh oh!

alexdowad Nov 11, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexdowad commented Nov 11, 2020

Uh oh!

alexdowad commented Nov 11, 2020

Uh oh!

Uh oh!