Major overhaul of mbstring (part 4) #6430

alexdowad · 2020-11-16T20:38:49Z

Continuation of #6416.

Fixes for CP932, CP51932, and Japanese mobile vendor-specific variants of Shift-JIS. Some simplifications, and also a few adjustments to mappings between legacy Japanese encodings and Unicode.

- Don't allow control characters to appear in the middle of a multi-byte character. (This was a strange feature of mbstring; it doesn't make much sense, and iconv doesn't allow it.) - Treat truncated multi-byte characters as an error.

Many Japanese encodings, such as JIS7/8, Shift JIS, ISO-2022-JP, EUC-JP, and so on encode characters from the JIS X 0208 character set. JIS X 0208 is based on the concept of a 94x94 table, with numbered rows and columns. However, more than a thousand of the cells in that table are empty; JIS X 0208 does not actually use all 94x94=8,836 possible kuten codes. mbstring had a dubious feature whereby, if a Japanese string contained one of these 'unmapped' kuten codes, and it was being converted to another Japanese encoding which was also based on JIS X 0208, the non-existent character would be silently passed through, and the unmapped kuten code would be re-encoded using the normal encoding method of the target text encoding. Again, this _only_ happened if converting the text with the funky kuten code to a Japanese encoding. If one tried converting it to Unicode, mbstring would treat that as an error. If somebody, somewhere, made their own private extension to JIS X 0208, and used the regular Japanese encodings like Shift JIS and EUC-JP to encode this private character set, then this feature might conceivably be useful. But how likely is that? If someone is using Shift JIS, EUC-JP, ISO-2022-JP, etc. to encode a funky version of JIS X 0208 with extra characters added, then that should be treated as a separate text encoding. The code which flags such characters with MBFL_WCSPLANE_JIS0208 is retained solely for error reporting in `mbfl_filt_conv_illegal_output`.

…rs through Similarly to JIS X 0208, mbstring would pass kuten codes which are not mapped in the JIS X 0212, JIS X 0213, or CP932 character sets through silently when converting to another Japanese encoding.

These flags identify text encodings in mbstring which use a constant number of bytes per character. While some parts of the code do use these flags, usually to detect cases which can be optimized due to constant-width encoding, nothing cares whether the encodings are 'LE' (little-endian) or 'BE' (big-endian). So we can simplify things by combining constants.

These constants indicate that a text encoding uses 2+ bytes for each character, and is either big endian or little endian (respectively). But nothing in mbstring cares about the difference between MBFL_ENCTYPE_MWC2BE and MBFL_ENCTYPE_MWC2LE. (Actually, nothing cares about whether these flags are set at all... maybe we should just remove them?)

…ants of Shift-JIS) Lots of problems here. - Don't pass 'control' characters through silently in the middle of a multi-byte character. - Treat it as an error if a multi-byte character is truncated. - For ESC sequences used to encode emoji on earlier Softbank phones, if an invalid ESC sequence is found, don't pass it through. Rather, handle it as an error and respect `mb_substitute_character`. - In ranges used by mobile vendors for emoji, if a certain byte sequence doesn't map to any emoji, don't emit a mangled value (actually a raw (ku*94)+ten value, which may not even be a valid Unicode codepoint at all). - When converting Unicode to SJIS-Mobile, don't mangle codepoints which fall in the 2nd range of MicroSoft vendor extensions. Some vendor-specific emoji have been mapped to standard Unicode codepoints now, rather than 'private use area' codepoints. When the legacy code was written, these codepoints may not have existed yet in the Unicode standard which was current at that time. Also do a major code cleanup -- remove dead code, rearrange what is left, use some new macros and helper functions to make the code clearer...

- Don't pass 'control' characters through in the middle of a multi-byte char - Treat truncated multi-byte characters as an error

Shift-JIS-2004 is an extension of Shift-JIS, which uses 0x5C for the Yen sign. Therefore, it is not correct to convert ASCII 0x5C (backslash) to Shift-JIS-2004 0x5C (yen sign). JIS X 0208 does have a backslash, so we can convert ASCII backslash to SJIS-2004 backslash instead. From time immemorial, there has been confusion around the treatment of 0x5C bytes on systems using legacy Japanese encodings. JIS X 0201 specified that 0x5C means a yen sign, and thus fonts on Japanese systems, including early versions of Windows, displayed a 0x5C byte as a yen sign. This meant that when ASCII text files were displayed on such systems, what were meant to be backslashes would appear as yen signs. Japanese C programmers could write character escapes using yen signs, and C compilers built on the assumption that the input was ASCII would interpret these escapes as desired. Likewise for shell scripts. Et cetera, et cetera... Therefore, if the input to `mb_convert_encoding` is (for example) a C program, and after converting to Shift-JIS-2004, the user wishes to feed the output into a C compiler, *then* perhaps ASCII 0x5C should be mapped to SJIS 0x5C. However, this scenario is ridiculous and will never happen. A more realistic scenario might be: an article written in SJIS-2004 has embedded Windows file paths (like 'C:\Program Files'), with yen signs used as a path separator. If we convert SJIS-2004 0x5C to ASCII 0x5C, then the path separators will be 'fixed' by the conversion. For general written texts, it is much better to convert backslashes to... backslashes. And yen signs, to yen signs.

When Microsoft created CP932 (their version of Shift-JIS), they explicitly used bytes 0-0x7F to represent ASCII characters rather than JIS X 0201 characters. So when converting Unicode to CP932, it is not correct to convert U+00A5 to CP932 0x5C. Fortunately, CP932 does have a multi-byte FULLWIDTH YEN SIGN character which we can use instead. CP51932 uses the same extended character set as CP932; while CP932 is MicroSoft's extended version of Shift-JIS, CP51932 is their extended version of EUC-JP. So the same reasoning applies to CP51932.

…ariants Converting U+203E to 0x7E was especially wrong for CP932, where 0x7E represents a tilde. For vanilla Shift-JIS and Shift-JIS-2004, converting to 0x7E is acceptable, since 0x7E does represent an overline/macron in those encodings. Follow the same principle in CP51932, which is closely related to CP932.

By entering this character in the JIS X 0208 conversion table, we can remove a bunch of explicit `if` clauses in different conversion filters. It also means that U+FF5E can be converted into SJIS-mac now; I don't know why this one SJIS variant rejected U+FF5E before, since 0x8160 means the same thing in SJIS-mac as the others.

…iants Except for vanilla Shift-JIS, where 0x7E is a halfwidth overline/macron. As for Shift-JIS-2004, it has an added character (byte sequence 0x854A) which was defined as a halfwidth macron in JIS X 0213:2000, so we use that.

nikic

Looks good to me ... emphasis on "looks". I have no idea whether these mapping changes are really "correct", but you seem to be making a good case for them :) These encodings are really something.

ext/mbstring/libmbfl/filters/mbfilter_sjis_mobile.c

alexdowad · 2020-11-25T18:54:28Z

Landed.

alexdowad added 14 commits November 16, 2020 22:19

Bugfixes for findInvalidChars (helper for mbstring test suite)

3595b0b

Enhance handling of CP932 text encoding

14cb329

- Don't allow control characters to appear in the middle of a multi-byte character. (This was a strange feature of mbstring; it doesn't make much sense, and iconv doesn't allow it.) - Treat truncated multi-byte characters as an error.

Don't pass invalid JIS X 0212, JIS X 0213, and Windows-CP932 characte…

bd91751

…rs through Similarly to JIS X 0208, mbstring would pass kuten codes which are not mapped in the JIS X 0212, JIS X 0213, or CP932 character sets through silently when converting to another Japanese encoding.

Enhance handling of CP51932 encoding

22a64d7

- Don't pass 'control' characters through in the middle of a multi-byte char - Treat truncated multi-byte characters as an error

0x7E is not a tilde in Shift-JIS{,-2004}

e2b5ea2

nikic approved these changes Nov 25, 2020

View reviewed changes

ext/mbstring/libmbfl/filters/mbfilter_sjis_mobile.c Show resolved Hide resolved

ext/mbstring/libmbfl/filters/mbfilter_sjis_mobile.c Show resolved Hide resolved

alexdowad closed this Nov 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major overhaul of mbstring (part 4) #6430

Major overhaul of mbstring (part 4) #6430

alexdowad commented Nov 16, 2020

nikic left a comment

alexdowad commented Nov 25, 2020

Major overhaul of mbstring (part 4) #6430

Major overhaul of mbstring (part 4) #6430

Conversation

alexdowad commented Nov 16, 2020

nikic left a comment

Choose a reason for hiding this comment

alexdowad commented Nov 25, 2020